Skip to content

DOC: update pandas.DataFrame.boxplot docstring. Fixes #8847 #20152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mabelvj
Copy link
Contributor

@mabelvj mabelvj commented Mar 10, 2018

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py pandas.DataFrame.boxplot
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single pandas.DataFrame.boxplot
  • It has been proofread on language by another sprint participant
################################################################################
##################### Docstring (pandas.DataFrame.boxplot) #####################
################################################################################

Make a box plot from DataFrame columns.

Make a box-and-whisker plot from DataFrame columns optionally grouped
by some other columns. A box plot is a method for graphically depicting
groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2).The whiskers extend from the edges
of box to show the range of the data. The position of the whiskers
is set by default to 1.5*IQR (IQR = Q3 - Q1) from the edges of the box.
Outlier points are those past the end of the whiskers.

For further details see
Wikipedia's entry for `boxplot <https://en.wikipedia.org/wiki/Box_plot>`_.

Parameters
----------
column : str or list of str, optional
    Column name or list of names, or vector.
    Can be any valid input to groupby.
by : str or array-like
    Column in the DataFrame to groupby.
ax : object of class matplotlib.axes.Axes, default `None`
    The matplotlib axes to be used by boxplot.
fontsize : float or str
    Tick label font size in points or as a string (e.g., ‘large’)
    (see `matplotlib.axes.Axes.tick_params
    <https://matplotlib.org/api/_as_gen/
    matplotlib.axes.Axes.tick_params.html>`_).
rot : int or float, default 0
    The rotation angle of labels (in degrees)
    with respect to the screen coordinate sytem.
grid : boolean, default `True`
    Setting this to True will show the grid.
figsize : A tuple (width, height) in inches
    The size of the figure to create in matplotlib.
layout : tuple (rows, columns) (optional)
    For example, (3, 5) will display the subplots
    using 3 columns and 5 rows, starting from the top-left.
return_type : {None, 'axes', 'dict', 'both'}, default 'axes'
    The kind of object to return. The default is ``axes``.

    * 'axes' returns the matplotlib axes the boxplot is drawn on.
    * 'dict' returns a dictionary whose values are the matplotlib
      Lines of the boxplot.
    * 'both' returns a namedtuple with the axes and dict.
    * when grouping with ``by``, a Series mapping columns to
      ``return_type`` is returned (i.e.
      ``df.boxplot(column=['Col1','Col2'], by='var',return_type='axes')``
      may return ``Series([AxesSubplot(..),AxesSubplot(..)],
      index=['Col1','Col2'])``).

      If ``return_type`` is `None`, a NumPy array
      of axes with the same shape as ``layout`` is returned
      (i.e. ``df.boxplot(column=['Col1','Col2'],
      by='var',return_type=None)`` may return a
      ``array([<matplotlib.axes._subplots.AxesSubplot object at ..>,
      <matplotlib.axes._subplots.AxesSubplot object at ..>],
      dtype=object)``).
**kwds : Keyword Arguments (optional)
    All other plotting keyword arguments to be passed to
    `matplotlib.pyplot.boxplot <https://matplotlib.org/api/_as_gen/
    matplotlib.pyplot.boxplot.html#matplotlib.pyplot.boxplot>`_.

Returns
-------
result:
    Options:

    * ax : object of class
      matplotlib.axes.Axes (for ``return_type='axes'``)
    * lines : dict (for ``return_type='dict'``)
    * (ax, lines): namedtuple (for ``return_type='both'``)
    * :class:`~pandas.Series` (for ``return_type != None``
      and data grouped with ``by``)
    * :class:`~numpy.array` (for ``return_type=None``
      and data grouped with ``by``)

See Also
--------
matplotlib.pyplot.boxplot: Make a box and whisker plot.
matplotlib.pyplot.hist: Make a hsitogram.

Notes
-----
Use ``return_type='dict'`` when you want to tweak the appearance
of the lines after plotting. In this case a dict containing the Lines
making up the boxes, caps, fliers, medians, and whiskers is returned.

Examples
--------

Boxplots can be created for every column in the dataframe
by ``df.boxplot()`` or indicating the columns to be used:

.. plot::
    :context: close-figs

    >>> np.random.seed(1234)
    >>> df = pd.DataFrame(np.random.rand(10,4),
    ...                   columns=['Col1', 'Col2', 'Col3', 'Col4'])
    >>> boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])

Boxplots of variables distributions grouped by a third variable values
can be created using the option ``by``. For instance:

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame(np.random.rand(10,2), columns=['Col1', 'Col2'] )
    >>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
    >>> boxplot = df.boxplot(by='X')

A list of strings (i.e. ``['X','Y']``) containing can be passed to boxplot
in order to group the data by combination of the variables in the x-axis:

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame(np.random.rand(10,3),
    ...                   columns=['Col1', 'Col2', 'Col3'])
    >>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
    >>> df['Y'] = pd.Series(['A','B','A','B','A','B','A','B','A','B'])
    >>> boxplot = df.boxplot(column=['Col1','Col2'], by=['X','Y'])

The layout of boxplot can be adjusted giving a tuple to ``layout``:

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame(np.random.rand(10,2), columns=['Col1', 'Col2'])
    >>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
    >>> boxplot = df.boxplot(by='X', layout=(2,1))

Additional formatting can be done to the boxplot, like suppressing the grid
(``grid=False``), rotating the labels in the x-axis (i.e. ``rot=45``)
or changing the fontsize (i.e. ``fontsize=15``):

.. plot::
    :context: close-figs

    >>> boxplot = df.boxplot(grid=False, rot=45, fontsize=15)

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Errors in parameters section
		Parameters {'kwds'} not documented
		Unknown parameters {'**kwds'}

captura de pantalla 2018-03-12 a las 0 51 51

captura de pantalla 2018-03-12 a las 0 52 04

captura de pantalla 2018-03-12 a las 0 52 12

captura de pantalla 2018-03-12 a las 0 52 18

@EliosMolina


Parameters
----------
data : the pandas object holding the data
column : column name or list of names, or vector

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parameter types do not follow the docstring guide.

For example, I suggest that column should be something like:

column : str or list of str, optional
    Column name or list of names, or vector. Can be any valid input to groupby.

An it applies to the other params as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

Copy link
Contributor

@dukebody dukebody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some important things about explaining arguments and format that should be worked on.

@@ -1935,52 +1935,134 @@ def plot_series(data, kind='line', ax=None, # Series unique


_shared_docs['boxplot'] = """
Make a box plot from DataFrame column optionally grouped by some columns or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line should be a short summary not including any details. See https://python-sprints.github.io/pandas/guide/pandas_docstring.html#section-1-short-summary.

@@ -1935,52 +1935,134 @@ def plot_series(data, kind='line', ax=None, # Series unique


_shared_docs['boxplot'] = """
Make a box plot from DataFrame column optionally grouped by some columns or
other inputs
Make a box-and-whisker plot from DataFrame column optionally grouped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you are focusing too much on the shape of a boxplot and how to interpret it. According to the guide:

The extended summary should provide details on why the function is useful and their use cases, if it is not too generic

So I believe it would be better to put more focus on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, now I read #8447 and I see we wanted to add more context about the components of a boxplot and what they mean. So it's OK to leave this.

Maybe add something about the use case for this graph:

A box plot shows the distribution of data with respect to a given variable.

Before the explanatin of the figure.

quartile values of the data, with a line at the median (Q2).
The whiskers extend from the edges of box to show the range of the data.
Flier points (outliers) are those past the end of the whiskers.
The position of the whiskers is set by default to 1.5 IQR (`whis=1.5``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"`whis=1.5``" is not formatted correctly, you should have double backquotes also at the start.

Anyhow I think the meaning of whis and its default should be explained in the Parameters section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will leave it because it is not defined in the pandas function, but as a matplotlib parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then as you say it's probably not good to write about this parameter here, since it will be confusing for the reader.


Parameters
----------
data : the pandas object holding the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right about removing this argument, because even if it's present in the original function, this function is never used alone, but always inside a Series of DataFrame, where data is self

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we thought the same, that it is always called from a pandas' object instantiating matplotlib. We checked with other functions (like pandas.plot.line) and data it is not included as parameter in the documentation. Maybe this explains why the validation test failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, the validation script is a linter, but we humans are more intelligent than linters and know when to not take into account their output. :)

Column in the DataFrame to group by
ax : Matplotlib axes object, optional
Column in the DataFrame to groupby.
ax : Matplotlib axes object, (default `None`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the guide, you should not use parenthesis around default values. Use Matplotlib axes object, default None. Also it would be nice to explain what None means here - create a new plot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better that the type is the Python type (meaning matplotlib.pyplot.axis if that's the right one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the axes returned by gca, matplotlib.axes.Axes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from matplotlib.figure.Figure.gca(), so probably matplotlib.axes.Axes, right?

>>> np.random.seed(1234)

>>> df = pd.DataFrame({
... u'stratifying_var': np.random.uniform(0, 100, 20),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs guide advises to try avoid using random data. In this case I believe it's justifiable because there is a relation between random distributions and boxplots and this can be shown in examples.

However please explain in plain English that you are creating a dataframe with points following a uniform and a normal distribution, respectively, in the different columns. You can also probably find two variables that are naturally uniform and normal, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkxz See this ^^ . @datapythonista what do you think? Is it OK to use random data with a seed sometimes in examples if it makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, feel free to update the documentation too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW do we want to use the unicode mark u'' at all in documentation examples? I believe we should always write strings without u or b unless necessary. This way it will be a bytestring in Python 2 and a text (unicode) in Python 3.

... u'price': np.random.normal(100, 5, 20),
... u'demand': np.random.normal(100, 10, 20)})

>>> df[u'quartiles'] = pd.qcut(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is making the example more complicated than needed. The user needs to understand what the qcut method does to understand this example. Can you change the example to not use this additional function?

18 77.282662 100.623565 103.540203 50-75%%
19 88.264119 98.386026 99.644870 75-100%%

To plot the boxplot of the ``demand`` just put:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Better create a boxplot. Repeating plot here is a bit redundant with boxplot. :)
  2. Avoid words as just, simply, etc.
  3. Here you are creating a boxplot of a column grouped by another one. Probably you want to create a boxplot without any grouping first to show the basic functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found the example in the pandas' cookbook. We also thought it was complicated. Found another example easier in pandas' visualization guide. We're working on changing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, perhaps boxplot = df.boxplot(column='demand') is good enough.


>>> boxplot = df.boxplot(column=u'demand', by=u'quartiles')

Use ``grid=False`` to hide the grid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just changing this argument doesn't look like a lot of changes to justify an example. Perhaps we can change multiple parameters at the same time to show a more interesting example.


>>> boxplot = df.boxplot(column=u'demand', by=u'quartiles', grid=False)

Optionally, the layout can be changed by setting ``layout=(rows, cols)``:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep but here you are playing with figsize as well, it's better if we mention that.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty cool, nice work.

Added some extra comments.


See Also
--------
matplotlib.pyplot.boxplot: Make a box and whisker plot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hist is a good candidate to be listed here. As they somehow represent the same.

>>> df
stratifying_var price demand quartiles
0 19.151945 106.605791 108.416747 0-25%%
1 62.210877 92.265472 123.909605 50-75%%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use df.head() and avoid having this long output? Or is it relevant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is probably too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it in the new version.

@@ -1935,52 +1935,134 @@ def plot_series(data, kind='line', ax=None, # Series unique


_shared_docs['boxplot'] = """
Make a box plot from DataFrame column optionally grouped by some columns or
other inputs
Make a box-and-whisker plot from DataFrame column optionally grouped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the one liner summary is missing, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already added it to the next commit.

column : column name or list of names, or vector
Can be any valid input to groupby
Can be any valid input to groupby.
by : string or sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be str instead of string. Also, I think we usually use array-like instead of sequence. I guess they mean the same in this case.

Column in the DataFrame to group by
ax : Matplotlib axes object, optional
Column in the DataFrame to groupby.
ax : Matplotlib axes object, (default `None`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better that the type is the Python type (meaning matplotlib.pyplot.axis if that's the right one).

ax : Matplotlib axes object, optional
Column in the DataFrame to groupby.
ax : Matplotlib axes object, (default `None`)
The matplotlib axes to be used by boxplot.
fontsize : int or string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str instead of string

@codecov
Copy link

codecov bot commented Mar 11, 2018

Codecov Report

Merging #20152 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20152      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         152      152              
  Lines       49255    49255              
==========================================
+ Hits        45226    45240      +14     
+ Misses       4029     4015      -14
Flag Coverage Δ
#multiple 90.23% <ø> (+0.02%) ⬆️
#single 41.9% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/plotting/_core.py 82.5% <ø> (ø) ⬆️
pandas/util/testing.py 84.73% <0%> (+0.2%) ⬆️
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4efb39f...39bb166. Read the comment docs.

@mabelvj
Copy link
Contributor Author

mabelvj commented Mar 11, 2018

Updated the documentation and the first comment of the PR with the new screenshots and output of the validation test.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docstring! Added a bunch of comments

All other plotting keyword arguments to be passed to
matplotlib's boxplot function
`matplotlib.pyplot.boxplot <https://matplotlib.org/api/_as_gen/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use :func:`matplotlib.pyplot.boxplot` instead of the full link

rot : label rotation angle
column : str or list of str, optional
Column name or list of names, or vector.
Can be any valid input to groupby.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to groupby? I think it selects which columns to plot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yest, the function _grouped_plot_by_column plots the columns performing a groupby.

Copy link
Contributor Author

@mabelvj mabelvj Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added :meth:pandas.DataFrame.groupby.

Column name or list of names, or vector.
Can be any valid input to groupby.
by : str or array-like
Column in the DataFrame to groupby.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say something about what the effect is for the plot?

Can be any valid input to groupby.
by : str or array-like
Column in the DataFrame to groupby.
ax : object of class matplotlib.axes.Axes, default `None`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping it as optional is fine. In the explanation below, I would add that if not passed, a new instance is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thought the same. Will add it.

Tick label font size in points or as a string (e.g., ‘large’)
(see `matplotlib.axes.Axes.tick_params
<https://matplotlib.org/api/_as_gen/
matplotlib.axes.Axes.tick_params.html>`_).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see more information about fontsize on that page?

column : str or list of str, optional
Column name or list of names, or vector.
Can be any valid input to groupby.
by : str or array-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add ", optional"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fixed.

rot : int or float, default 0
The rotation angle of labels (in degrees)
with respect to the screen coordinate sytem.
grid : boolean, default `True`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed to use backticks in the type description


`**kwds` : Keyword Arguments
The size of the figure to create in matplotlib.
layout : tuple (rows, columns) (optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) -> , optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the doubt how to format it. Changed all to just optional.

``return_type`` is returned (i.e.
``df.boxplot(column=['Col1','Col2'], by='var',return_type='axes')``
may return ``Series([AxesSubplot(..),AxesSubplot(..)],
index=['Col1','Col2'])``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather add an example in the "Examples" section. As putting it here inline will always be limited

:context: close-figs

>>> df = pd.DataFrame(np.random.rand(10,3),
... columns=['Col1', 'Col2', 'Col3'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can re-use the df from the previous code-block, so maybe not needed to re-define here again

Copy link
Contributor Author

@mabelvj mabelvj Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove all and use just the first dataframe, but wouldn't it be clearer to define a new dataframe in each code-block so the user has a clear image of the structure and can reproduce the example just by copy-pasting it, without having to look for the origin of the dataframe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends, because also it can help to see plots of the exact same data with different variations, to see how the different arguments affect the output plot. Also, reusing the same dataframe can help making the examples section a bit shorter and less verbose.

Your call :)


See Also
--------
matplotlib.pyplot.boxplot: Make a box and whisker plot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to put a space before the colons here as well.

See Also
--------
matplotlib.pyplot.boxplot: Make a box and whisker plot.
matplotlib.pyplot.hist: Make a hsitogram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hsitogram → histogram

.. plot::
:context: close-figs

>>> df = pd.DataFrame(np.random.rand(10,3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jorisvandenbossche comments in #20373 (comment), I agree it's better to draw from a normal distribution here to get more natural boxplots.

>>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
>>> boxplot = df.boxplot(by='X')

A list of strings (i.e. ``['X','Y']``) containing can be passed to boxplot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence doesn't sound correct, it's missing some noun at "a list of strings containing (what?) can be passed to boxplot".

... columns=['Col1', 'Col2', 'Col3', 'Col4'])
>>> boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])

Boxplots of variables distributions grouped by a third variable values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"grouped by the values of a third variable" sounds more natual to me, but I'm not a native English speaker.

@dukebody
Copy link
Contributor

@mabelvj can you go over the comments in the review? There are some of them that need to be addressed before we can merge this.

@mabelvj
Copy link
Contributor Author

mabelvj commented Mar 18, 2018

Yes, I was reviewing them all before making a commit. Had some doubts with the examples for return_type part.

@dukebody
Copy link
Contributor

You accidentally added some binary files to your last commit.

@@ -2078,27 +2067,27 @@ def plot_series(data, kind='line', ax=None, # Series unique
:context: close-figs

>>> np.random.seed(1234)
>>> df = pd.DataFrame(np.random.rand(10,4),
>>> df = pd.DataFrame(np.random.randn(10,4),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8 will demand a space after the comma, so np.random.randn(10, 4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It accepts it, no error popped out!

@mabelvj mabelvj force-pushed the DOC_Fixes_issue_8847_boxplot_docs branch from 11105ac to 7daedd4 Compare March 18, 2018 13:59
returned by `boxplot`. When ``return_type='axes'`` is selected,
the matplotlib axes on which the boxplot is drawn are returned:

>>> df.boxplot(column=['Col1','Col2'], return_type='axes')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it might be enough with a couple of examples for the return types, perhaps being so exhaustive is too verbose and a bit overkill, since one has to care about the ellipsis and all... The different possibilities are already explained at the parameters section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll keep just the most complicated case that needed explanation from the parameters definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can do axes = df.boxplot(...). Then the return isn't printed out and the test will pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to make a point about the return type, you can do

>>> result = df.boxplot(...)
>>> type(result)
<class pandas.Series>

or something like that.

Copy link
Contributor Author

@mabelvj mabelvj Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tip! Didn't know how to do it. The internal structure is not shown but it's enough. Thanks @TomAugspurger!

@mabelvj mabelvj force-pushed the DOC_Fixes_issue_8847_boxplot_docs branch 2 times, most recently from 8bb9d4d to e9476fc Compare March 21, 2018 16:21
@TomAugspurger
Copy link
Contributor

Just to verify, the validate_docstring script passes with your latest (aside from **kwargs)?

lines : dict
ax : matplotlib Axes
(ax, lines): namedtuple
result:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one section I'm unsure about. I would maybe reformat it as

keyord : type

so

The return type depends on the `return_type` parameter

* 'axes': a matplotlib.axes.Axes
* 'dict' : a dictionary of ...
* 'both': namedtuple of ...

When grouping the data by `by`, a Series of `return_type` is returned.

And then whatever is going on with return_type=None and by. Sorry this is such a mess, I think it's my fault :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?:

    result: type

        The return type depends on the `return_type` parameter:

        * 'axes' : object of class matplotlib.axes.Axes
        * 'dict' : dict of matplotlib.lines.Line2D objects
        * 'both': a nametuple with strucure (ax, lines)

        For data grouped with ``by``:

        * :class:`~pandas.Series`
        * :class:`~numpy.array` (for ``return_type = None``)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! However I wouldn't write the "type" part after result :, since it doesn't add anything in this case, everything is explained later.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job, added comments about few minor things.

other inputs
Make a box plot from DataFrame columns.

Make a box-and-whisker plot from DataFrame columns optionally grouped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a comma after "columns"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the first line should be like a short description and then a whole paragraph going into detail, or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the comma not in the short summary instead of the period. The comma in the whole paragraph Make a box-and-whisker plot from DataFrame columns, optionally grouped :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'll add that.

The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2).The whiskers extend from the edges
of box to show the range of the data. The position of the whiskers
is set by default to 1.5*IQR (IQR = Q3 - Q1) from the edges of the box.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use backticks for the formula, and use spaces as in Python, so `1.5 * IQR (IQR = Q3 - Q1)`. I think the render in html should look more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks as code or cursive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not sure if it's code. It's your call, I think having a different format from the rest would help understand. May be ``IQR * 1.5`` where ``IQR = Q3 - Q1``? ;) That seems like kind of Python propoer code. :)

But whatever you like, it was just an idea that I think would make it a bit easier to read.

by some other columns. A box plot is a method for graphically depicting
groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2).The whiskers extend from the edges
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after the period.

layout : tuple (rows, columns), optional
For example, (3, 5) will display the subplots
using 3 columns and 5 rows, starting from the top-left.
return_type : {None, 'axes', 'dict', 'both'}, default 'axes'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the convention would be {'axes', 'dict', 'both'} or None, default 'axes'.

Copy link
Contributor Author

@mabelvj mabelvj Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it, but isn't it more confusing with one of the possible values outside the dict?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, I didn't see it this way. But I think the idea is that None is seen more as a type than as a value in this case. The format would be: {'small', 'large'} or int. So, may be the "right" way would be {'axes', 'dict', 'both'} or NoneType or the one you had with the None inside. But if I'm not mistaken everywhere else te None is outside as None. And to me it's probably a bit clearer.

See Also
--------
matplotlib.pyplot.boxplot : Make a box and whisker plot.
matplotlib.pyplot.hist : Make a histogram.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider hist a related method, and I agree it is, I'd also add here the pandas version (Series.plot.hist).

For the matplotlib ones, I'd use something like "Matplotlib equivalent boxplot", or something like that. I find it slightly confusing that the description is kind of the same as this method, and not explain why.

If we add the pandas hist, may be we can leave the matplotlib out, as it should be linked from its page.


>>> np.random.seed(1234)
>>> df = pd.DataFrame(np.random.randn(10,4),
... columns=['Col1', 'Col2', 'Col3', 'Col4'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem with the .. plot:: directive to create df in a previous block? If it's not, in most documentation pages what I've seen is:

  • >>> df = ...
  • >>> df or >>> df.head() if it's long
  • Some explanation of the function, boxplot in this case
  • df.boxplot()

I think it's a bit more clear that the user first sees the data for the example, and then the explanations and the function usage is shown.

If that's a problem with .. plot:: just leave it as it is, is not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been discussed before. I think that it is clearer for the user to know what dataframe is using since many types of grouping can be done and it is more explicit this way. For consistency, I took the examples from Pandas Documentation: Plotting with matplotlib: Box-plotting where a new dataframe is created for each example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I missed that discussion, sorry.

@mabelvj mabelvj force-pushed the DOC_Fixes_issue_8847_boxplot_docs branch from 0756028 to 478ed1a Compare March 25, 2018 07:54
@mabelvj mabelvj force-pushed the DOC_Fixes_issue_8847_boxplot_docs branch from 478ed1a to 487352b Compare April 2, 2018 13:31
@mabelvj
Copy link
Contributor Author

mabelvj commented Apr 2, 2018

@TomAugspurger, no errors aside from **kwds

################################################################################
##################### Docstring (pandas.DataFrame.boxplot) #####################
################################################################################

Make a box plot from DataFrame columns.

Make a box-and-whisker plot from DataFrame columns, optionally grouped
by some other columns. A box plot is a method for graphically depicting
groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2). The whiskers extend from the edges
of box to show the range of the data. The position of the whiskers
is set by default to `1.5 * IQR (IQR = Q3 - Q1)` from the edges of the box.
Outlier points are those past the end of the whiskers.

For further details see
Wikipedia's entry for `boxplot <https://en.wikipedia.org/wiki/Box_plot>`_.

Parameters
----------
column : str or list of str, optional
    Column name or list of names, or vector.
    Can be any valid input to :meth:`pandas.DataFrame.groupby`.
by : str or array-like, optional
    Column in the DataFrame to :meth:`pandas.DataFrame.groupby`.
    One box-plot will be done per value of columns in `by`.
ax : object of class matplotlib.axes.Axes, optional
    The matplotlib axes to be used by boxplot.
fontsize : float or str
    Tick label font size in points or as a string (e.g., `large`).
rot : int or float, default 0
    The rotation angle of labels (in degrees)
    with respect to the screen coordinate sytem.
grid : boolean, default True
    Setting this to True will show the grid.
figsize : A tuple (width, height) in inches
    The size of the figure to create in matplotlib.
layout : tuple (rows, columns), optional
    For example, (3, 5) will display the subplots
    using 3 columns and 5 rows, starting from the top-left.
return_type : {'axes', 'dict', 'both'} or None, default 'axes'
    The kind of object to return. The default is ``axes``.

    * 'axes' returns the matplotlib axes the boxplot is drawn on.
    * 'dict' returns a dictionary whose values are the matplotlib
      Lines of the boxplot.
    * 'both' returns a namedtuple with the axes and dict.
    * when grouping with ``by``, a Series mapping columns to
      ``return_type`` is returned.

      If ``return_type`` is `None`, a NumPy array
      of axes with the same shape as ``layout`` is returned.
**kwds : Keyword Arguments, optional
    All other plotting keyword arguments to be passed to
    :func:`matplotlib.pyplot.boxplot`.

Returns
-------
result :

    The return type depends on the `return_type` parameter:

    * 'axes' : object of class matplotlib.axes.Axes
    * 'dict' : dict of matplotlib.lines.Line2D objects
    * 'both' : a nametuple with strucure (ax, lines)

    For data grouped with ``by``:

    * :class:`~pandas.Series`
    * :class:`~numpy.array` (for ``return_type = None``)

See Also
--------
Series.plot.hist: Make a histogram.
matplotlib.pyplot.boxplot : Matplotlib equivalent plot.

Notes
-----
Use ``return_type='dict'`` when you want to tweak the appearance
of the lines after plotting. In this case a dict containing the Lines
making up the boxes, caps, fliers, medians, and whiskers is returned.

Examples
--------

Boxplots can be created for every column in the dataframe
by ``df.boxplot()`` or indicating the columns to be used:

.. plot::
    :context: close-figs

    >>> np.random.seed(1234)
    >>> df = pd.DataFrame(np.random.randn(10,4),
    ...                   columns=['Col1', 'Col2', 'Col3', 'Col4'])
    >>> boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])

Boxplots of variables distributions grouped by the values of a third
variable can be created using the option ``by``. For instance:

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame(np.random.randn(10,2), columns=['Col1', 'Col2'] )
    >>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
    >>> boxplot = df.boxplot(by='X')

A list of strings (i.e. ``['X','Y']``) can be passed to boxplot
in order to group the data by combination of the variables in the x-axis:

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame(np.random.randn(10,3),
    ...                   columns=['Col1', 'Col2', 'Col3'])
    >>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
    >>> df['Y'] = pd.Series(['A','B','A','B','A','B','A','B','A','B'])
    >>> boxplot = df.boxplot(column=['Col1','Col2'], by=['X','Y'])

The layout of boxplot can be adjusted giving a tuple to ``layout``:

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame(np.random.randn(10,2), columns=['Col1', 'Col2'])
    >>> df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
    >>> boxplot = df.boxplot(by='X', layout=(2,1))

Additional formatting can be done to the boxplot, like suppressing the grid
(``grid=False``), rotating the labels in the x-axis (i.e. ``rot=45``)
or changing the fontsize (i.e. ``fontsize=15``):

.. plot::
    :context: close-figs

    >>> boxplot = df.boxplot(grid=False, rot=45, fontsize=15)

The parameter ``return_type`` can be used to select the type of element
returned by `boxplot`.  When ``return_type='axes'`` is selected,
the matplotlib axes on which the boxplot is drawn are returned:

    >>> boxplot = df.boxplot(column=['Col1','Col2'], return_type='axes')
    >>> type(boxplot)
    <class 'matplotlib.axes._subplots.AxesSubplot'>

When grouping with ``by``, a Series mapping columns to ``return_type``
is returned:

    >>> boxplot = df.boxplot(column=['Col1','Col2'], by='X',
    ...                      return_type='axes')
    >>> type(boxplot)
    <class 'pandas.core.series.Series'>

If ``return_type`` is `None`, a NumPy array of axes with the same shape
as ``layout`` is returned:

    >>> boxplot =  df.boxplot(column=['Col1','Col2'], by='X',
    ...                       return_type=None)
    >>> type(boxplot)
    <class 'numpy.ndarray'>

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Errors in parameters section
		Parameters {'kwds'} not documented
		Unknown parameters {'**kwds'}

@mabelvj
Copy link
Contributor Author

mabelvj commented Apr 2, 2018

Result:

captura de pantalla 2018-04-02 a las 15 34 08

captura de pantalla 2018-04-02 a las 15 34 24

captura de pantalla 2018-04-02 a las 15 38 51

captura de pantalla 2018-04-02 a las 15 34 43

captura de pantalla 2018-04-02 a las 15 34 48

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Just pushed a commit changing the formatting on some of the examples.

Merging later today.

@jorisvandenbossche jorisvandenbossche merged commit 6eda77e into pandas-dev:master Apr 3, 2018
@jorisvandenbossche
Copy link
Member

@mabelvj Thanks a lot !!

@mabelvj
Copy link
Contributor Author

mabelvj commented Apr 3, 2018

Thank you! I've learned a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants